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ABSTRACT 



This report reviews existing research on the effects of high 
stakes tests on students, particularly students with disabilities. The review 
focuses on potential effects on the curriculum, student learning, attitudes 
and school climate, and the costs versus benefits of high stakes testing of 
students with disabilities. Results indicate that research results on high 
stakes testing are inconclusive and vary with the type of research questions 
asked and the types of tests examined. The evidence suggests that teachers 
change the curriculum based on the tests and concentrate time and effort 
teaching to test content and format. The effects on student learning are 
largely unknown, but the evidence does suggest that increasing test scores in 
themselves do not serve as evidence that students are learning more. High 
stakes testing seems to have a negative effect on the attitudes and workloads 
of teachers, but little is known about the effects on students. States still 
do not take into account the full costs of high stakes testing programs, and 
claims that testing alone can cause major educational improvement have not 
been proven. Recommendations for future research are provided. An appendix 
summarizes the literature reviewed for the report. (Contains 46 references.) 
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Executive Summary 



Increasingly, schools are administering tests that have important 
consequences for students. This literature review looks at existing research 
on the effects of these high stakes tests on students, with particular attention 
to students with disabilities. The review focuses on potential effects on the 
curriculum, students, student learning, attitudes, school climate, and costs. 

Based on this review, we propose the following recommendations for 
future research and for those people who develop, implement, and evaluate 
large-scale testing programs: 

• Focus on the effects of high stakes testing on students rather than on 
schools and systems. 

• Assess the effects of high stakes testing on the curriculum for both 
special and regular education. 

• Assess the effects of high stakes testing on the dropout rates for 
both special and regular education. 

• Study the effects of high stakes testing programs on students who 
are excluded from testing. 

• Develop assessments that are more inclusive of students with 
disabilities (for example, including students with disabilities in state 
norming samples, and norming tests with some students using 
common accommodations such as extended time and Braille). 

• Study the effects of high stakes testing on the relationship between 
regular and special education. 

• Develop a framework for evaluating the costs versus the benefits of 
high stakes testing programs, particularly for alternative and 
authentic assessments. 
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High Stakes Testing 



Over the past two decades, statewide testing of students has become more and more common. 
This testing is conducted to meet a multitude of purposes and a wide variety of instruments is 
used. The practice of statewide assessment has been the center of a great deal of debate and 
political rhetoric (Popham 1987; Salganik, 1985; Shepard 1992). Although some states have 
attached high stakes, such as graduation, to tests since the early 1970s (e.g., Florida, New 
York), research findings on the effects of high stakes testing have not been decisive. The 
Office of Technology Assessment (OTA) (1992) summarized the findings by stating: 

In the end, then, there appears to be consensus that innovation in school testing 
policies can have profound effects-the disagreement is over the desirability of 
those effects. Although some of the evidence is contradictory, at times even 
confusing, one thing is clear; test-based accountability is no panacea. Specific 
proposals for tests intended to catalyze school improvement must be scrutinized 
on their individual merits (p. 15). 

Research looking at effects on students, particularly students with disabilities, is the focus of 
this report. In the past, most research has emphasized school improvement and system level 
effects. It is important, therefore, to carefully examine the research for evidence of student 
level effects. 

As the clients of our educational system became more diverse in their characteristics 
(Hodgkinson, 1992; Hodgkinson & Outtz, 1992), it becomes increasingly important to look 
more specifically at differential effects of major reforms. Since we know that groups of 
students perform differently (Mullis et al., 1994), testing, particularly high stakes testing that 
may have important consequences for students, has the potential to affect some groups 
differently from other groups. Among the groups of students for whom differential effects 
might be expected are those with disabilities (Olson & Goldstein, 1996). At present there is 
increased pressure for states to use tests to decide whether students will earn a high school 
diploma and to hold all students to the same requirements for graduation (see Thurlow, 
Ysseldyke, & Anderson, 1995). It is important to re-examine what we know about high stakes 
testing and what the research literature tells us. 

What is “High Stakes”? 

A test can be considered high stakes if the results of the test have perceived or real 
consequences for students, staff, or schools (Madaus, 1988). Increasingly, states, cities, and 
school boards are using test scores in order to evaluate schools and allocate resources. In 
October 1996, Chicago put 109 schools on academic probation. According to Hendrie (1996) 
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scores from nationally normed standardized tests were a chief factor in determining who would 
be placed on probation. Manzo (1996) reported that Philadelphia was planning to link teacher 
raises and cash awards to schools based on student test scores, attendance, and graduation 
rates. For schools with chronically low-performing students, schools could be forced to 
replace up to three-fourths of their staffs. 



The consequences of testing can be both intended and unintended. Corbett and Wilson (1991) 
state: 



Stakes can become high when test results automatically trigger important 
consequences for students or the school system, and also when educators, 
students, or the public perceive that significant consequences accompany test 
results. Thus, a formal trigger of consequences need not be built into the 
testing program for stakes to be high. Instead, test results can cause the public 
to make an assessment of the quality of the school system that serves them, and 
this judgment in turn can lead to a conclusion that children’s choices . . . have 
been affected. The product of this process can be increased public pressure to 
improve test scores, especially when the perception is that the system is likely to 
have a negative impact on those choices, (p. 27) 

While this review includes research using this broad definition of high stakes, the high stakes 
tests in which we are most interested are those that determine a student’s progress through and 
out of school. Examples are: minimum competency tests (MCTs), graduation exit exams, and 
tests used to decide promotion from grade to grade. 



Literature Review Procedure a 

This analysis of the literature is based on educational and psychological journals and 
unpublished research from 1980 to the present. The literature search revealed fewer than thirty 
research studies, with only five studies specifically focusing on persons with disabilities. The 
articles include published and unpublished research, state-sponsored evaluation and research 
reports, papers from professional meetings, and articles from various research organizations 
such as CRESST (Center for Research on Evaluation, Standards and Student Testing). 

It became clear, as we reviewed the available evidence, that we would generate more questions 
than answers. The review was limited not only by the lack of research in this area, but also by 
the quite broad uses of testing for high stakes purposes. For example, some of the studies 
focused on elementary schools and the pressure placed on schools to do well for financial or 
other purposes. Other studies focused on high school exit exams and still others focused on 
exams that determine promotion from grade to grade. 
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Since the articles contained information on a variety of situations that are considered “high 
stakes” (and thus are not directly comparable), a meta-analysis of the data was not attempted. 
Instead, a decision was made to conduct a descriptive analysis, and to organize the data as a 
series of research questions, providing summary and analysis of results and directions for 
future research. Most of the studies included here are based on state tests, yet each state has a 
different approach to testing, and each program is evolving. The studies and evaluation 
reports used in this review do not necessarily reflect the particular state’s 
current testing practices. A high stakes testing program in which schools are evaluated 
according to the results on the Iowa Test of Basic skills is different from a program in which 
students must pass a test in order to graduate from high school. While often the studies 
themselves are not directly comparable, we do have enough studies in this review to see 
emerging patterns that require further investigation. A summary of the studies we used in this 
review is provided in the appendix. 

This report is organized around a series of questions. The available literature addresses some 
of these questions, but all are in need of further exploration and consideration by the people 
who make testing decisions. The questions center around four basic effects of high stakes 
testing: effects on the curriculum, effects on student learning, effects on student and teacher 
attitudes and the climate of learning, and the costs versus the benefits of high stakes testing. 
Each section of this review focuses on high stakes testing in general, and then explores the 
implications for persons with disabilities. Directions for future research also are proposed. 



Curriculum and Instruction - 

Most research on the effects of high stakes testing has looked at the curriculum and instruction. 
Still, the information from this research is relatively indirect. Seven studies were reviewed (see 
first table in the appendix) that addressed the effects of high stakes testing on the curriculum 
and instruction. 

What Effect Does High-Stakes Testing Have on the Curriculum? 

Several claims have been made as to the possible positive and negative effects of high stakes 
testing on the curriculum. Some individuals, such as Popham (1987), believe that the 
curriculum will improve as schools, teachers, and students attempt to meet the challenges that 
testing will impose. Others (Madaus, 1988; Shepard, 1992) fear that high stakes testing will 
narrow the curriculum, focus on lower-order skills, or take control of the curriculum away 
from local sources. 
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In a study comparing a low-stakes state (Pennsylvania) and a high stakes state (Maryland), 
Corbett and Wilson (1990, 1991) found that in a high stakes situation, teachers reported a 
narrowing of the curriculum, but that not all teachers thought this was a bad thing. According 
to Corbett and Wilson (1990), “Maryland school districts focused more directly on improving 
test scores, altered the curriculum to a greater extent, reported more improvement in the 
curriculum, and felt the curriculum had narrowed more than their Pennsylvania colleagues” (p. 
72). They also found that smaller districts were more likely to make greater curriculum and 
instruction adjustments. Teachers did not always think that these changes were necessarily in 
the best interests of the students. In general, high stakes testing affected both the content and 
the sequence of instruction, and efforts to affect test scores directly increased as the testing 
dates approached. 

Rottenberg and Smith (1990) used qualitative interviews in a high stakes elementary setting in a 
state (not identified) that used the Iowa Test of Basic Skills (ITBS). The test was considered 
high stakes because the scores were used in the evaluation of principals and for making 
curriculum decisions. The media also reported scores on the ITBS by school and grade level. 
Rottenberg and Smith found that testing reduced the time available for ordinary instruction. 
Schools were also neglecting material not in the tests, while encouraging the use of 
instructional methods resembling testing, such as multiple-choice exams. 

Shepard and Dougherty (1991) addressed test-preparation practices and effects of testing on 
instruction. They used districts from two states with high stakes testing, one in the Southwest 
and one in the Southeast, and sampled teachers in the 3rd, 5th, and 6th grades. They found 
that teachers gave greater emphasis to basic skills instruction, and that nontested content 
suffered because of the focus on standardized tests. They also found that teachers spent an 
inordinate amount of time preparing for tests rather than focusing on the regular curriculum. 
They reported four weeks of intensive test preparation, plus two weeks for administering the 
test itself. This emphasis on preparation was not limited to the time surrounding the test 
administration; 68% of the teachers reported using worksheets throughout the year to review 
expected test content and to give students practice with the testing formats. Consistent with 
other studies, teachers did report that the tests helped them to set clear instructional goals. 

Herman and Golan (undated) also found that teachers spent an inordinate amount of time 
preparing for tests. They surveyed upper elementary school teachers in nine states with high 
stakes testing using a 136 item questionnaire designed specifically for this study. Consistent 
with Shepard and Dougherty (1991), they found that teachers were spending class time on 
worksheets covering test content and format. Teachers also changed the content and sequence 
of instruction throughout the year. While higher order skills and nontested subjects such as 
arts and science had less emphasis, teachers continued to cover these subjects. Basic skills, 
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however, were given the most emphasis. As a result of testing, teachers changed the content 
and sequence of instruction based on prior tests and how their classes did the previous year. 
Another trend found in this study was that socioeconomic status (SES) had a great deal to do 
with how well students performed on the tests and how much the tests affected instruction. 
The authors found that socioeconomic status was significantly and negatively related to the 
amount of attention that schools and teachers gave to test scores, curriculum planning, and time 
devoted to test-related activities. According to Herman and Golan (undated), “testing is more 
influential and exerts stronger effects on teaching in schools serving more disadvantaged 
students” (p. 58). 

Rodgers, Paredes, and Mangino (1991) looked at the effects of the Texas Educational 
Assessment of Minimum Skills (TEAMS), a test that students needed to pass in order to 
graduate from high school. The study took place over five years, using 12,404 eleventh grade 
students from the Austin Independent School District. The test focused on language arts and 
math. Rodgers et al. found that basic skills, as measured on the Tests of Achievement and 
Proficiency (TAP), increased as a result of the minimum competency exam, but that higher 
order skills remained the same. They concluded that districts should be cautious about 
narrowing the curriculum and letting higher order skills suffer for the sake of improving test 
scores. 

In summary, the literature provides very little direct evidence about the effects of high stakes 
testing on the curriculum. The results are consistent, however, in showing that high stakes 
testing does affect what and how teachers teach. One can argue about whether these effects are 
positive or negative. For example, Shepard and Dougherty (1991) found that high stakes 
testing narrowed the curriculum; however, in Corbett and Wilson (1990), some teachers 
thought that the curriculum had narrowed while others thought that the curriculum had become 
more focused. Berger and Elson (1996) found that in a high stakes environment, teachers 
reported a clearer mission for their schools. The authors claimed that this supports advocates’ 
claims that measurement driven instruction “adds focus, coherence, and clarity to the mission 
of a school. Teachers and students know what is expected of them and how they will be 
judged” (p. 22). These may all be aspects of the same phenomenon and deserve to be looked 
at more closely. 

How Are Educational Opportunities For Persons With Disabilities 
Affected By High Stakes Testing? 

We could find no studies that directly assessed the effects of high stakes testing on the 
curriculum for students with disabilities. Indeed, some researchers, such as Rodgers et al. 
(1991), purposely excluded students with disabilities from their studies in order to make data 
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analysis easier. This section addresses some of the evidence suggesting that educational 
opportunities for students with disabilities are negatively affected by high stakes testing. 

Bergquist, Elzie and Groves (undated) conducted an evaluation of the impact and effectiveness 
of Florida graduation and competency test standards on students with disabilities in 1986-87. 
They found that students with disabilities had trouble earning the 24 credits required for a 
standard diploma in four years. This made it difficult for them to accommodate vocational 
training in their programs. Students with disabilities even had trouble earning 24 credits 
toward a special diploma with courses paralleling the standard diploma, and these students also 
had difficulty including vocational training in their programs. The authors concluded that 
students were “more likely to leave high school with a standard or special diploma but no 
marketable job skill” (p. 10). Also, accommodations for students in the academically-oriented 
classes varied from district to district. As a result, children who failed in one district might 
have succeeded in another district that was more willing to meet their needs. The authors 
reported that “districts with greater flexibility in the course requirements for the diploma were 
better able to meet the unique needs of the handicapped student through the course offerings at 
the high school level” (p. 10). 

Little is known about the effects of increased graduation standards on curriculum offerings. 
Grossman, Kirst, and Schmidt-Posner (1986) found that increased graduation requirements 
and entrance requirements for college in California resulted in increased offerings in math and 
science, and decreased offerings in industrial arts, home economics, business education, etc. 
If this is a national trend, then one can expect that as resources decrease and become more 
concentrated on academic subjects, the opportunities for students to find an appropriate 
education may decrease as well. 

Directions 

It is interesting that the majority of the research is almost entirely focused on teachers and their 
perceptions rather than on students and their performance. For example, do students study 
longer for tests when the stakes are high? If they do, does this generalize to the rest of the 
curriculum, or are students more likely to study only for the test, and what they believe the test 
will include? 

For students with disabilities, it is especially important to assess how high stakes testing affects 
the curriculum. Should students be concentrating on vocational and other objectives, or should 
their lEPs focus on the skills measured on a high stakes test? Educators and legislators must 
be made aware that the needs of students with disabilities may not always be best met with 
what is offered in the regular academic curriculum. If students with disabilities are to be 
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included in educational reforms, then their needs both for accommodations in testing and 
modification of the curriculum must also be taken into account. 

The evidence reviewed so far suggests that high stakes testing affects the curriculum. These 
effects can be viewed as positive or negative, but the negative view is more pervasive. It is 
equally, if not more important, however, to ask what effect these tests have on students and 
learning. 



Student Learning g- 

Indeed, questions about the impact of high stakes testing on students with disabilities are the 
critical ones that need answering. To date, however, there is virtually no evidence that 
adequately addresses the question of how high stakes testing affects student learning. Again, 
we turn to the study by Corbett and Wilson (1991) of a high stakes state (Maryland) where 
teachers spent more time on test preparation, used more practice tests, and conducted more 
content reviews. They noted, however, that “while the numbers visually document more 
intense activity on the part of educators in Maryland, missing is any assessment regarding its 
value to improved learning” (p. 70). 

The media have touted many statewide testing programs on the basis that students’ scores on 
the tests have improved over time. In this section, we review studies (see second table in the 
appendix) that suggest this interpretation of test scores as a measure of actual student 
improvement may be premature. Many factors that have nothing to do with actual learning can 
affect scores on high stakes tests. This section addresses several of these factors, including 
teaching to the test, reclassifying students so that they no longer have to take the test (e.g., 
through referral to special education), and increased dropout rates. 

How Well Does High Stakes Testing Measure Student Learning? 

“Teaching to the test” means a concentration on skills that increase test scores regardless of the 
amount of knowledge the student actually possesses. There are some generally accepted skills, 
often called “testwiseness,” that many states openly encourage. Testwiseness includes things 
such as getting a good night’s sleep, knowing how to make educated guesses, using relaxation 
techniques, etc. Other practices are considered unethical and include such things as giving 
students copies of the test prior to administration, hinting at answers, and changing answers on 
test papers. In the middle ground, there exist practices such as giving students worksheets 
(multiple choice, five-minute essays) to familiarize students with the format of the tests. One 
extreme example of this was reported by Madaus (1988) of an entire course dedicated to five- 
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paragraph limed argumeniative essays. This course was being offered in lieu of other 
advanced courses in writing. 

Several studies have been conducted to determine whether increases in test scores reflect real 
increases in student learning. Shepard (1990) conducted interviews with teachers in a high 
stakes testing environment to investigate the claim that “teaching to the test” was causing 
inflated test scores. Most instances of teaching to the test were of a mild form, such as using 
commercially marketed programs designed to teach test-taking skills to students. Other 
practices included such things as school-developed practice tests, pep rallies to “psych kids up” 
to do well, and high school courses specifically aimed at the competency measures. This study 
found evidence that questionable practices existed, but could not quantify their effects on actual 
test scores. 

In a different approach, Walstad (1984) looked at what kinds of practices were responsible for 
increases in test scores. Controlling for other factors such as socioeconomic status (SES), 
Walstad looked at three variables: pretesting students, curriculum changes based on the state’s 
education standards, and district-sponsored workshops to increase the skills of teachers in 
implementing the standards. Pretesting, a practice where students were able to practice the test 
format, was the only significant variable that contributed to an increase in test scores. 
Curriculum and instructional changes had no significant impact. This suggests that increases in 
the test scores were not due to actual learning, but rather to familiarity with the tests. 

Another way to determine whether high stakes tests are measuring actual learning is to look at 
how they generalize to other tests. In theory, if a student is learning a skill such as spelling, 
this knowledge should generalize across state tests and other tests of achievement. Koretz, 
Linn, Dunbar, and Shepard (1991) looked at whether performance on a third grade high stakes 
test would generalize to other tests. Mathematics scores did not generalize from one test to 
another, and reading scores generalized only a small amount, of little practical significance. 
Koretz et al. concluded that “to a substantial degree, teachers in this district must be focusing 
on content that is specific to the particular test used for accountability, rather than trying to 
improve achievement in the broader sense that we would all desire” (pp. 20-21). 

Other factors that have nothing to do with learning also may influence test scores. Shepard 
(1992) stated that: 

Because of the pressure on test scores, more hard-to-teach children are rejected 
by the system. There is a direct correspondence between accountability 
pressure and the number of children denied kindergarten entrance, assigned to 
two-year kindergarten programs, referred to special education, made to repeat a 
grade, or who drop out of school (p. 5). 
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This kind of academic “redshirting” — keeping students back a grade in order to improve test 
scores (see Zlatos, 1994) has been found in several other studies as well. Potter and Wall 
(1992) found evidence that, as early as preschool, children were kept back a grade so that they 
would do better on the tests. Allington and McGill-Franzen (1992a) examined test scores in 
districts that had claimed increases in student performance on high stakes tests. The districts 
came from a variety of settings (urban, suburban, rural) and socioeconomic status. Rather than 
finding evidence of increased learning and better teaching, the authors found an increase in the 
proportion of students retained a grade or placed in special education. The authors recalculated 
the test data by determining which children started kindergarten together. When test scores of 
children who had been identified for special education or who had been held back a year were 
included in the test scores, the gains districts had been reporting disappeared. 

Allington and McGill-Franzen (1992b) also looked at trends in these schools and found an 
increasing number of students were being referred to special education or retained a grade 
during a period of increased high stakes. 

Similar factors also were found to influence test scores on the Texas minimum competency 
tests of the mid-eighties. Mangino, Battaile, and Washington (1986) found that increased test 
scores were probably due to factors other than actual learning. They identified several problem 
areas, including students taking the tests many times, the ability of students to gain waivers 
from taking the tests, and a higher percentage of students using special education exemptions. 
Thus, a school with incentives to do better on the statewide tests could improve overall scores 
by encouraging poor students to get waivers, or by placing them in special education. 

Dropping out of school is perhaps the most serious of the effects of high stakes testing, but it 
has been very difficult to prove a causal connection between high stakes testing and dropout 
rates. Indirect evidence of increased dropout rates has been found by researchers such as 
Potter and Wall (1992) who looked at the effects of a graduation exit exam on students in 
South Carolina. They found that more students were retained a grade as a result of the high 
stakes testing. Overage students were more likely to drop out of school. In a longitudinal 
study from 1982-1989, Morris (1991) looked at patterns of changes in grade retention rates in 
a large urban school district in Florida, where tests were considered high stakes due to pressure 
from the media. Among other factors (such as restructuring the schools from a junior high to a 
middle school model), he found that high stakes testing and increased graduation standards 
increased the retention rates. He found that retention rates increased during the test years when 
the state introduced a diagnostic test to identify failing students for remediation. The increases 
in retention rates spread to other grades when the test was used to identify weak school 
programs. When the graduation requirements increased while the time to achieve the 
requirements decreased, retention rates also rose. 

15 
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Another study that attempted to establish a link between high school dropout rates and high 
stakes testing was conducted by Catterall (1987). This study looked at states that had 
minimum competency examinations for graduation from high school. Catterall used interviews 
to ask teachers and students their perceptions of the impact of minimum competency testing on 
dropout rates. Few teachers and administrators thought that minimum competency testing had 
much of an impact, and none could cite any evidence to support their beliefs. They saw the 
minimum competency tests as “largely meaningless and innocuous” (p. 22). Students, 
however, had very different perceptions, and 14% of the students said that they knew someone 
who had failed the test and dropped out of school as a result. Catterall also found a high 
correlation between students who had failed the test at least once and who expressed doubt as 
to whether they would finish school. 

Two recent studies have attempted to examine the link between high stakes testing and dropout 
rates directly, with very different results. The studies looked at different populations and types 
of tests. Griffin and Heidorn (1996) reported on the Florida minimum competency tests in 
mathematics and communication, which students must pass in order to graduate from high 
school. Students may take the tests up to five times, starting in grade 10. The authors found 
that dropout rates increased only for students who were doing well academically and 
subsequently failed the tests. Dropout rates did not increase for students who already had poor 
academic records, or for minority students. 

Reardon (1996) found very different results when he examined minimum competency tests that 
students needed to pass in order to be promoted from eighth to ninth grade. This study used 
data from the National Educational Longitudinal Study (NELS) to look specifically at retention 
and dropout rates related to high stakes testing. Interestingly, Reardon found that minimum 
competency testing was more prevalent in urban schools with high concentrations of low- 
income and minority students. 

This uneven distribution of MCT requirements may simply mean that the 
prevalence of MCTs is related to the prevalence of lower-achieving students-the 
group proponents believe the tests are most likely to help. But it raises an 
important concern as well: if MCTs do influence some students to drop out 
who would not have otherwise, then not only are MCT policies harmful, but 
their harmful effects are disproportionately concentrated on those students with 
the fewest opportunities for success (p. 5). 

Reardon did, in fact, find evidence that dropout rates do increase as a result of minimum 
competency testing and that furthermore, “the . . . data also suggests that it is the concentrated 
poverty of these schools and their communities, and their concomitant lack of resources, that 
link MCT policies to higher dropout rates, rather than other risk factors, such as student 
grades, age, attendance, and minority group membership” (p. 5). Reardon found that eighth 
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grade students with minimum competency testing requirements dropped out at double the rate 
of students without MCT requirements (8.8% as opposed to 4.2%). Breaking down the data 
by socioeconomic status, Reardon found that low and moderately low SES schools were the 
most related to high dropout rates. MCTs had little or no effect on the dropout rates in higher 
SES schools. 

While these results conflict with those of Griffin and Heidorn (1996), these differences can be 
explained by the differences in the type of tests and populations used in the studies. The 
Florida test, on which Griffin and Heidorn base their study, has been described as a basic-level 
test that does not set particularly high standards. In addition, students may take the test up to 
five times, perhaps relieving some of the pressure of the high stakes. The Reardon (1996) 
study, on the other hand, uses a more representative sample, concentrating on high stakes tests 
that have earlier and more immediate consequences for students. 

In summary, increased test scores on a high stakes test do not necessarily translate into 
increased learning for students. The problems go beyond simple inflation of test scores, which 
is a serious concern from a measurement standpoint. These studies point to a frightening but 
very real possibility that children will be systematically and deliberately labeled, excluded, and 
pushed out of the system altogether in order to improve test scores. Madaus (1988) argued that 
teaching to the test is part of human nature. 

Some have argued strongly that if the skills are well chosen, and if the tests 
truly measure them, then coaching is perfectly acceptable. This argument 
sounds reasonable, and in the short term, it may even work. However, it 
ignores a fundamental fact of life: when the teacher’s professional worth is 
estimated in terms of exam success, teachers will corrupt the skills measured by 
reducing them to the level of strategies in which the examinee is drilled .... 

The view that we can coach for the skills apart from the tradition of test 
questions, embodies a staggeringly optimistic view of human nature that 
ignores the powerful pull of self-interest, (p. 93) 

How Does High Stakes Testing Affect the Learning of Students with 
Disabilities? 

Macmillan, Balow, Widaman, and Hemsley (as cited in Griffin and Heidorn, 1996) found that 
students with disabilities who failed a minimum competency exam had higher dropout rates 
than regular education students. This study, however, did not control for other factors that 
might also affect dropout rates, such as academic performance or problem behavior. 

As far as effects on what students with disabilities are actually learning as a result of high 
stakes testing, we do not really know. If researchers have paid little attention to effects on 
typical students, they have paid even less attention to students with disabilities. Even where 
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students with disabilities are included in statewide testing programs, they are often left out of 
research and evaluation reports. This is unfortunate, since people with disabilities have much 
to lose when it comes to high stakes testing, especially for graduation purposes. As Safer 
(1980) pointed out, students who lack a regular diploma may be discriminated against in 
employment later in life. 

How Does Inclusion of Students with Disabilities Affect State and 
Local Test Scores? 

Inclusion of students with disabilities in statewide assessments causes problems for many 
persons concerned with the accurate measurement of students. When persons with disabilities 
are not included in norming samples, or accommodations such as large print editions are not 
part of the initial test standardization process, then including people, or allowing 
accommodations raises legitimate questions about the reliability and validity of the test scores. 
According to Bond, Roeber, and Braskamp (1996), 31 states currently use norm-referenced 
tests, such as the Iowa Test of Basic Skills (ITBS) or the Stanford Achievement Test (SAT). 
Scores on these tests are used for improvement of instmction (29 states), program evaluation 
(24 states), school performance reporting (22 states), student diagnosis (19 states), and high 
school graduation (two states). Many times, students with disabilities are left out of these tests 
or denied accommodations because the test was not standardized on these populations. Current 
research efforts are trying to determine whether accommodations such as large print, extra time 
on tests, Braille versions of tests, etc. are a valid means of including students in these 
assessments. Other efforts have been made to determine whether tests can be administered to 
persons with disabilities and reported using separate norms and percentile ranks. Chin- 
Chance, Gronna, and Jenkins (1996) found that the state of Hawaii had enough students taking 
the state’s norm-referenced test to report separate norms and percentile ranks for students with 
and without disabilities and within disability categories. 

Research into testing for students with special needs should not stop at simply determining the 
validity of using a type of accommodation. It is possible to develop tests that are normed on all 
students, and using formats that would include a greater number of people. For example, tests 
can be developed using larger print, so that a separate test in large print need not be developed. 
Test items can be developed keeping in mind that some people may not be able to see the 
figures, but will need to rely on verbal information. 

Criterion-referenced tests, for which a student must simply demonstrate a proficiency in a 
subject are more easily adaptable for persons with disabilities. Here, the student need only 
demonstrate proficiency in the areas to be tested. For example, in the state of Hawaii, students 
must pass the Hawaii State Test of Essential Competencies (HSTEC) to receive a high school 
diploma. Students who fail the HSTEC once may take the test again until they pass, or may 
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pass the test through the Essential Competencies Certification Center that uses an open-ended 
response format. According to Bond et al. (1996), 36 states currently use criterion-referenced 
tests. Sixteen states use criterion-referenced tests for graduation purposes. Thirty-four states 
use written assessments, 12 of them for graduation, and 18 states use alternative forms of 
assessment, two of them for graduation. While inclusion of students with disabilities in these 
types of assessments is more easily accomplished than in norm-referenced tests, it is still 
imperative that states take the needs of students with disabilities into account when they 
develop testing programs in order to avoid costly problems later. 

How Do Students with Disabilities Perform on High Stakes 
Assessments? 

In general, the few studies that have been conducted on students with disabilities show that 
they do poorly when compared to peers without disabilities. In Hawaii, Chin-Chance et al. 
(1996) looked at the scores of students with disabilities taking the Stanford Achievement Test 
8th Edition (SAT8) in the 3rd, 6th, 8th, and 10th grades. Hawaii uses the SAT8 for national, 
school, district, and local comparisons. While the study was not conducted using the state’s 
graduation exit exam (the Hawaii State Test of Essential Competencies), the SAT8 can still be 
considered high stakes because the scores are public, comparisons are made among districts 
and schools, and SAT8 scores are used to make judgments about a school’s effectiveness. The 
study looked at data for students with disabilities who took the statewide administration of the 
SAT8 in 1994 and 1995. They found that students with disabilities in all categories did more 
poorly on the tests than did students without disabilities; however, when looking at difference 
scores from one year to the next, students in Hawaii showed more improvement than did the 
national norms, and students with disabilities did as well as, or better than, students without 
disabilities. This study did not look at how students with disabilities were faring on the 
graduation exit exam, a test with much higher stakes than the SAT8. 

Safer (1980) found that in Florida in the late 1970s, students with disabilities were not likely to 
pass the minimum competency test required for graduation (see Table 1). 

According to the data reported by Safer, between six percent (educable mentally retarded) and 
71% (speech/language impaired) of students passed the communications subtest. Altogether, 
only 46% of the special education students taking the Florida graduation exit exam passed the 
test. Between one percent (educable mentally retarded) and 33% (speech/language impaired) 
passed the math subtest, only 18% of the total. 
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Table 1. Percentage of High School Juniors with Disabilities Passing Two 
Subtests of the Florida Minimal Competency Examination in 1977. 


Handicapping Condition 


Communications 

Subtest 


Math 

Subtest 


N 


Speech/language impaired 


71 


33 


509 


Deaf 


47 


18 


126 


Hard of hearing 


65 


29 


49 


Physically impaired 


67 


14 


110 


Emotionally disturbed 


56 


17 


114 


Socially maladjusted 


49 


25 


79 


Learning disabled 


49 


17 


502 


Educable mentally retarded 


6 


1 


479 



From Safer (1980), p. 289 



McKinney (1983) looked at how students with disabilities performed on the North Carolina 
Minimum Competency test, both with and without accommodations. He found that some 
groups were more likely to benefit from test modifications than others. Still, the probability of 
passing the test was low, especially for students with mild mental retardation, even with 
modifications. Out of 3,043 students taking the MCT in 1978, the reported pass rates (see 
Table 2) were only slightly better than those reported by Safer (1980). Even so, many students 
with disabilities could not pass both subtests, a requirement for receiving a diploma. Of the 
students who failed the test on the first try, 78% took the test a second time. Of these students, 
only 35% passed the reading subtest and 28% passed the math subtest. Students with visual 
impairments had the best retest success rates (72%) and students with mild mental handicaps 
had the lowest retest success rates (21%). 



Table 2. Percent of Students Who Passed the Subtests on the North 
Carolina Minimum Competency Test for Graduation. 


Handicapping Condition 


Reading Subtest 


Math Subtest 


Educable mentally handicapped 


12 


7 


Learning disabled 


56 


47 


Visually impaired 


92 


88 


Hearing impaired 


75 


70 


Multiply handicapped 


32 


28 


“Other” handicapping conditions 


66 


57 



From information in McKinney (1983), p. 547-548. 
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McKinney also looked at how test modifications affected student performance on the test. Half 
of the students in the sample received modified tests. Modifications included extended time, 
small group administration, audio cassettes, large print editions, and sign language. Use of 
modifications was not assigned at random, but varied across educational districts and type of 
school personnel (special education teachers were seven to 17 times more likely to use test 
modifications). Students with mild mental retardation who used modifications were more 
likely to pass the test than those who did not receive modifications. Students with hearing 
impairments were actually less likely to pass with modifications. Test modifications had no 
significant impact on other groups. 

In interviews with school personnel, McKinney (1983) found that special educators were 
concerned about the impact of the testing on students with mild mental impairments. Teachers 
reported that “some exceptional students, particularly educable mentally handicapped students 
found the test extremely frustrating” (p. 549). Even though students with mild mental 
disabilities benefited from the accommodations, they were still less likely than any of the other 
groups to pass the tests, and were reportedly more likely to become frustrated by the testing 
process. Hall and Gallagher (as cited in Vitello, Camilli, and Molenaar, 1987) analyzed the 
same data and found that while remediation did make it possible for 50% of the students with 
disabilities retaking the test to pass on the second try, students with mild mental retardation 
were not helped by remediation efforts. 

Vitello, Camilli, and Molenaar (1987) also found that students with mild mental handicaps had 
the most difficulty passing minimum competency exams. Their study examined the scores of 
the 4,299 students with disabilities who took the New Jersey competency test. At this time, 
students with disabilities who completed their lEP objectives were given a standard diploma. 
Of the students with disabilities who were eligible, only 40% actually took the exam, and 12% 
of these students passed. Only one percent (26 of 1438) students with mild mental disabilities 
took the exam, and, of this group, only four percent actually passed the test. 

Directions 

The studies reviewed so far focus primarily on factors that may inadvertently influence test 
outcomes. The literature says very little about exactly what students should be learning or 
doing differently as a direct result of testing. More studies, such as that of Catteral (1987), 
which focus specifically on students and how testing reforms affect their lives and learning are 
also needed. The conflicting results of Griffin and Heidorn (1996) and Reardon (1996) 
indicate that not all testing reforms have the same results. It seems to us that every major 
testing reform requires the type of analysis that Griffin and Heidorn conducted, as well as 
continuing large-scale studies such as Reardon’s which look at a broader spectrum of tests. 
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There is a dearth of evidence about the effects of high stakes testing on students with 
disabilities. If, for example, schools are referring more students to special education as a result 
of high stakes testing, then we need to know whether these students are getting the instruction 
they need, or whether we are simply putting them in positions where less is expected of them. 
The effects of both inclusion and exclusion of students with disabilities should be examined. 
For example, if we exclude students with disabilities, we may then spend less time teaching 
them important skills that the tests cover. If we include students with disabilities, we may then 
be neglecting other parts of their education, such as vocational skills that the lEP team believes 
to be important. 

The exclusion of any group from a high stakes testing program makes it difficult to evaluate the 
effects on these students, particularly when looking at effects on dropout rates. When some 
students are excluded from testing in the first place, an increase in the dropout rate of this 
population would not show up in the results of studies such as Griffin and Heidorn (1996) or 
Reardon (1996). Students with disabilities, particularly learning and behavioral disabilities are 
already at a particularly high risk for dropping out of school (Sinclair, Christenson, Thurlow, 
& Evelo, 1994). It is possible that exclusion from high stakes testing-especially when the 
result is denial of a diploma-could push even more of these students out of school. 

The evidence so far suggests that students with disabilities do not fare well on minimum 
competency tests. Furthermore, we still do not know whether preparing for the tests, taking 
the tests, or passing the tests have any consequences, positive or negative, for these students. 

It is important for special educators to encourage new testing programs, and large testing 
companies, to begin including students with disabilities in norming samples, and to include 
some common accommodations (such as extended time, reading directions, large print 
editions) when developing tests. This will allow educators to more accurately measure the 
performance of students with disabilities, and increase the inclusion of students with 
disabilities in important assessment activities. 

Thus far, we have examined the influence of high stakes testing in two obvious areas- 
curriculum and instruction, and student learning and performance. In those sections we found 
evidence that teachers focus the curriculum on the content and structure of tests. We also 
found that test scores often are corrupted through artificial inflation due to teaching to the test 
and reclassifying or retaining students who perform poorly so that they do not have to take the 
tests. High stakes testing may have other effects as well, effects that have nothing to do with 
academic performance or test scores. 
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Attitudes and School Climate 



This section addresses the less tangible issues of school climate and the attitudes of teachers 
and students toward testing (see the third table in the appendix for a summary of the literature 
reviewed). In this section, we distinguish between two types of high stakes testing; high 
stakes for students, and high stakes for schools and teachers. In most instances, if a test is 
high stakes for one group, it will be high stakes for all groups. Unfortunately, most of the 
research in this area has concentrated on the effects of high stakes testing on teachers and 
schools. High stakes for students include failure to get a diploma or grade promotion. 
Students may also experience high stakes when pressure is put on schools to perform better, 
and the schools in turn put pressure on the students. Schools and teachers may experience 
high stakes in terms of funding (for example, bonuses for good performance, decreases in 
funding for poor performance), public scrutiny by the media and politicians (for example, 
publishing test scores comparing schools or districts), or job security (for example, threatening 
to restructure schools if test performance does not improve). 

What Effects Does High Stakes Testing Have on Teachers’ and 
Students’ Attitudes, and on the Climate of Learning? 

Corbett and Wilson (1991) found that staff in a high stakes state (Maryland) reported greater 
impact on their students’ and their own lives than did staff in a low-stakes state. Teachers in 
the high stakes state also reported more stress, more paperwork, and decreased reliance on 
their professional judgment. 

A qualitative study using classroom observations and interviews of teachers by Rottenberg and 
Smith (1990) found negative effects for both students and teachers in a high stakes testing 
program. They looked at the role of external testing in elementary schools in Arizona. The 
tests used in these schools, such as the Iowa Test of Basic Skills (ITBS), were considered high 
stakes because the results were used in the evaluation of principals and schools, and because 
the media reported ITBS scores by school and grade level. 

For pupils, particularly younger ones, most teachers believe that standardized 
testing is cruel and unusual punishment. Because of the length and difficulty of 
tests, the number of tests, the time limits, the fine print, and the difficulty in 
transferring answers to answer sheets, teachers believe tests cause stress, 
frustration, burn-out, fatigue, physical illness, misbehavior and fighting, and 
psychological distress. Some teachers believe that the tests cause their pupils to 
develop test anxiety and a failure mentality, (p. 17) 

Effects on teachers were perceived to be equally negative: 

[Teachers] feel ashamed and embarrassed if their pupils score low or fail to 
grow by district standards. They feel relieved rather than proud when scores 
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are high, for they know that test scores are weighted more by pupils’ 
socioeconomic status and level of effort than anything teachers personally do in 
the classroom, (pp. 17-18) 

Similar effects were found by Herman and Golan (undated) who reported that pressure to 
improve test scores was negatively related to job satisfaction and pride in teaching for upper 
elementary school teachers in nine states across the country. 

A recent study by Berger and Elson (1996) looked at the effects of minimum competency 
testing on teacher autonomy, cooperation and school mission. They used data from the 1987 
Schools and Staffing Survey conducted by the U.S. Department of Education that included 
surveys of over 19,000 teachers across the country. They compared teachers’ responses in 
high stakes testing programs to those in states with low stakes testing programs. A high stakes 
program was defined as one in which the diploma is withheld when the student does not pass. 
As mentioned before, they found that high stakes were correlated with a clearer sense of 
mission for their schools. This generally positive finding, however, was accompanied by a 
reduced sense of autonomy. Contrary to other studies, Berger and Elson did not find a 
correlation between high stakes testing and a reduction in teacher cooperation. This was the 
only study we found that looked at a representative sample of teachers and their attitudes 
toward testing. 

High stakes testing appears to have both positive and negative effects on students and teachers. 
Testing causes stress and frustration for both teachers and, reportedly, for students as well. 
Teachers reported decreased autonomy and ability to rely on their professional Judgment. 
Interestingly, an increased sense of clear mission is the one positive attitudinal change to be 
documented from high stakes testing. Absent from this evidence is any direct study of how 
students have been affected both emotionally and academically by testing programs. 

What are the Emotional/Attitudinal Effects on Students with 
Disabilities? 

Especially when it comes to graduation exit exams, one can assume that the effects on students 
with disabilities are at least as serious, if not more so, than on students without disabilities. 
McKinney (1983) discussed the possible repercussions for students with disabilities who, 
according to his study, have a limited probability of passing the test, even with modifications. 
The professional opinions of the teachers involved in the study were that students, especially 
those with mild mental disabilities, would experience frustration. 

Many states leave the decision about the participation of students in high stakes tests up to the 
Individualized Education Program (lEP) team (see Thurlow et al., 1995). These teams need to 
take into account the possible negative effects of both participation and nonparticipation in the 
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testing programs. Again, little evidence exists about the real effects that high stakes testing has 
had on students with disabilities. 

Directions 

As in other areas, the research into attitudes and school climate should focus more directly on 
students, particularly students with disabilities. Similarly, given the perceptions of educators, 
it might be beneficial to focus training efforts toward addressing the perceptions of negative 
effects and how to avoid these. 

To date, there has been no research into the effects of high stakes testing on the relationship 
between special and regular education. If stakes for administrators and educators are high (for 
example, linked to promotions, bonuses, sanctions for poor performance, etc.) will regular 
educators see special education students as a threat, bringing down test scores and possibly 
resulting in the loss of funds or even jobs? Does high stakes testing affect the inclusion of 
students with disabilities in the regular classroom? Do special educators see high stakes testing 
as a threat to their autonomy and the ability to individualize the educational programs of 
students with disabilities? Do students with disabilities view high stakes testing as yet another 
barrier to their inclusion as members of our society, or do they see “just another test”? 



Costs Versus Benefits 

So far in this report, we have examined the possible benefit or harm that may occur as a result 
of high stakes testing. In the next section, we will look at the costs versus the benefits of high 
stakes testing and the extent to which costs have been accurately estimated and taken into 
account during the development of high stakes testing programs. 

What are the Costs vs. the Benefits of High-Stakes Testing? 

The issue of costs versus benefits of high stakes testing is not an area that has been thoroughly 
researched. Some advocates claim that tests can cause substantial improvement in schools at 
very little cost, with very little effort on the part of legislators, government, or the public. The 
idea seems to be that students and teachers will work harder to gain the rewards of scoring well 
on the tests, and avoid the consequences of scoring poorly. 

The February 16th Daily Report Card (DRC), an on-line newsletter put out by the National 
Education Goals Panel, reported on a controversy about testing taking place in Virginia. 
According to the DRC, the governor was proposing a new testing program that would be used 
to determine district funding and could also be used to decide whether teachers and 
administrators should keep their jobs. Teachers complained that they were being asked to be 
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more accountable and to achieve higher standards without the tools and training to do so. An 
opponent of the proposal was quoted, saying, “Testing is education reform on the cheap. 
That’s the problem. It’s oversimplified.” This section addresses whether testing reforms, 
particularly high stakes testing, are as cost-effective as proponents such as Popham (1987) 
claim. 

According to Popham (1987), “If properly conceived and implemented, measurement-driven 
instruction currently constitutes the most cost-effective way of improving the quality of public 
education in the United States” (p. 679). A properly conceived and constructed test should be 
criterion-referenced, with defensible content containing a manageable number of targets 
designed for instructional illumination. He also provided for ample instructional support so 
that educators can make the best use of the tests, which he sees as “vehicles of instructional 
clarification” (p. 681). It is the task of the persons who design the tests to ensure that the 
content will drive both instruction and the curriculum. These things can all be accomplished 
without other elements of reform, such as better curriculum materials, or better paid and trained 
staff: 



if we were able to replace mediocre instructional materials with more potent, 
empirically proven alternatives, then pupils would surely benefit. Similarly, if 
we were to infuse into our current teaching force a host of well-paid, highly 
skilled teachers, we could surely expect major educational dividends. But such 
strategies, though they are surely effective, are very costly (p. 679). 

Can a test be both “properly conceived and constructed” and still be less costly than other 
education reforms? Anderson (1977) brought up the subject of hidden costs of high stakes 
testing in his background paper prepared for the Minimal Competency Workshops sponsored 
by the Education Commission of the States and the National Institute of Education. Many of 
the hidden costs that Anderson outlined have yet to be taken into account when evaluating the 
testing programs of today. These include such things as test development, test administration, 
development and maintenance of regulatory mechanisms (bureaucracies), and compensatory 
programs to bring students who do not pass the minimum competencies up to the current 
standards. Anderson saw this last area, remediation, as containing the greatest hidden cost, 
and the evidence supports his prediction. 

Potter and Wall (1992), in conducting a study of South Carolina’s minimum competency 
testing program, found modest gains in student performance on minimum competency tests. 
Besides evidence mentioned before, that these gains may be attributable to factors other than 
actual learning (such as keeping children back a grade in order to avoid testing them for another 
year), the remediation efforts to raise the test scores had cost the state over $500 million. After 
spending that much money, the state still had no real evidence that the remediation had done 
much good. 
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Foshee, Davis, and Stone (1991) found that many students who in the past had failed a 
minimum competency test had subsequently passed without remediation. The study calls into 
question the cost-effectiveness of providing remediation to students who do not do well on a 
minimal competency exam. 

Singer and Balow (1987), in evaluating California’s proficiency law since 1980, found that 
students did better in regular English class than in remedial classes when retaking a minimum 
competency test. They were concerned by a shift in resources directed at students who were 
not doing well on the test, reducing the availability of advanced courses. 

The evidence suggests that remediation programs not only are expensive, but often ineffective 
as well. The actual costs of state-administered high stakes tests are not known, especially as 
compared to the costs of other educational reforms. 

Who Benefits the Most from High Stakes Testing? 

Anderson (1977) predicted that remediation programs would be too expensive, and suggested 
rewarding schools that perform well on the high stakes tests. Compensatory education 
programs, he believed, were emerging in the wrong direction, providing substantial incentive 
for schools to do poorly in order to get increased funds for remediation. Rewarding schools 
for good performance, however, has the added problem of diverting funds to schools that are 
already financially better off and doing well rather than giving the funds to schools that have 
greater needs. Anderson (1977) recommended that states work to reduce the financial 
inequities that exist between school districts as part of the overall effort to impose high stakes 
testing. 

Socioeconomic status can have an impact on who benefits from high stakes testing. Tuma and 
Gifford (1995) found that higher graduation standards (not necessarily accompanied by high 
stakes testing) had affected the number of academic courses that college-bound students were 
taking, but had not affected students who were not college-bound. Potter and Wall (1992) 
found that overage students who had been withheld a grade prior to years where testing was to 
occur were more likely to be male and nonwhite. Reardon (1996) found that high stakes tests 
caused increased dropout rates only in schools with lower socioeconomic status. Herman and 
Golan (undated) found that: 

Correlations show that socioeconomic status is significantly and negatively 
related to the following: school attention to test scores, teachers’ attention to 
testing and planning their instruction, and overall time devoted to test 
preparation activities . . . testing is more influential and exerts stronger effects 
on teaching in schools serving more disadvantaged students.” (p. 57-58) 
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This study was not able to determine whether the increased effects on schools with lower SES 
resulted in better performance on the tests, so it is difficult to interpret these consequences as 
positive or negative. 

Not all studies found a relationship between SES and performance on high stakes tests. 
Corbett and Wilson (1991) found that SES played a “surprisingly weak” role in explaining 
differences between districts. 

It is very difficult to determine from the available evidence whether high stakes testing has 
different effects on students with low or with high SES. The evidence does suggest, however, 
that we need to take a closer look at this question to make sure that we are not using high stakes 
testing as a means to further inequities that already exist in our educational system. 

What About Portfolio and Authentic Assessments? 

Portfolio and authentic assessments have been gaining popularity (Bond et al., 1996). 
Portfolio assessments are in-depth looks into a students’ learning histories. They might 
include all of the assessments that the students take, as well as examples from their classroom 
work and other evidence of learning. Authentic assessments are designed to gain an in depth 
look at the students’ performance level using tasks that are instructionally relevant to the child, 
and based on tasks that would normally be expected as part of a curriculum. Only recently 
have these types of assessments been used by states for high stakes purposes. These tests 
have many advantages over the usual multiple choice exam. They give more information about 
students, and are potentially more useful to teachers, and they measure higher order skills that 
are more difficult to assess with traditional paper and pencil tests. Portfolio and authentic 
assessments have several imposing disadvantages, however. In particular, they take more time 
to develop and implement. In addition, someone has to judge the students’ responses and 
determine whether they meet the educational standards. The reliability of such judgments on a 
large-scale assessment program has yet to be established (Shepard, 1992). 

Vermont was one of the first states to use portfolio assessments on a large-scale basis. 
According to Koretz, McCaffrey, Klein, Bell, and Stecher (1993), the intent of the assessment 
was to encourage high standards and good educational practices while maintaining local 
autonomy. Koretz et al. (1993) evaluated the 1992 Vermont Portfolio Assessment program 
and found disappointing reliability coefficients ranging from .33 to .43 (see Table 3). 



ERIC 



22 



28 



NCEO 



Table 3. Reliability Coefficients for the 1992 
Vermont Portfolio Assessment 




Grade 4 


Grade 8 


Mathematics Best Pieces 


.33 


.33 


Writing Best Piece 


.35 


.42 


Writing Remainder 


.34 


.43 



From Koretz et al. (1993) 



According to Koretz et al. (1993): 

The Vermont portfolio program faces substantial hurdles because of the 
unreliability of scoring documented here. Rater reliability is low enough to 
undermine the utility of 1992 scores for comparing groups of students (schools, 
districts, or other groups). Even when scores are aggregated enough to 
produce estimates with small measurement and sampling error-for example, 
statewide reporting of average scores-low reliability threatens the usefulness for 
gauging trends in performance over time, because it remains uncertain how an 
increase in the reliability of ratings will affect the distribution of scores even if 
true performance remains constant, (p. 18) 

Teachers, however, reported that they liked the portfolio assessments and thought that they 
were a valuable tool in gauging student progress, and many schools had expanded the portfolio 
program beyond the grades required by the state (Koretz et al. 1993, p. 1-2). 



An even more ambitious testing effort to assess student learning for purposes of school 
accountability took place in England. Torrance (1993) looked at the effects of an authentic 
assessment program in Great Britain. The assessments used in the U.K. are designed to 
accompany a National Curriculum. The goal is to assess all children in the National 
Curriculum subjects at the ages of 7, 11, 14 and 16 through teacher assessment of course work 
(TA) and “standard assessment tasks” (SATs). Results are assembled into individual scores 
with 10 levels of achievement, with expected progress at one level per two years of schooling. 
Scores are reported to parents by subject and to the public by school. At the time of this study, 
implementation of this plan had begun only with the younger children (seven-year-olds). 

The aspiration was to produce tasks that would contribute to a system of 
assessment that was both formative and evaluative-aiding the learning of pupils 
by providing detailed information to teachers, and . . . providing directly 
comparable data on pupil and school achievement. At the same time, the tasks 
were meant to be as valid and ‘user friendly’ as possible . . . while also guiding 
the implementation of the National Curriculum through exemplifying what it 
should look like in action (examples of tasks include rolling toy cars down 
slopes of different gradients to investigate how far they travel and why; using 
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dice in the context of a simple game to test computational skills; drawing and 
labeling a poster to illustrate how and why things grow). Thus, the 1990 pilot 
. . . was launched with the intention of piloting tasks that mirrored good 
primary education practice while at the same time yielding formatively useful 
information and comparable summative results, (p. 83) 

The evaluation of this pilot project attempted to get responses from teachers in an informal, 
open-ended survey. Very few of the participating teachers responded. Of those who did 
respond, most said that they “were so exhausted and disillusioned with the whole business that 
they very nearly did not get in touch-they could hardly face spending yet more time on an 
experience they would rather forget” (p. 84). 

The major complaint that teachers made in their responses was about their increased workload. 
According to Torrance, “the most commonly reported figure was two to three hours of extra 
work every evening for marking, record keeping, gathering resources, and planning the next 
day’s work, plus six hours each weekend. All this was in addition to a full day’s work in 
school beginning well before and ending well after the children’s school day” (p. 85). This 
drain on teacher time had indirect effects on the rest of the school. Teachers of the seven-year- 
olds did not have time for assemblies, playground duties and other activities, leaving the 
burden of these activities to the other teachers in the school. Teachers reported that 
relationships with parents and students also were affected since teachers no longer had the time 
to welcome parents into the classroom, or offer extra help to students outside of class time. 
Other classroom duties were neglected as well, such as changing reading books and 
maintaining regular classroom routines. While students were engaged in the individual and 
small group tasks, the students who were not being assessed showed signs of boredom and 
stress from being ignored and subjected to “busy work.” Torrance asserted, “It is clear, then, 
that teachers did not plan their curriculum, initiate a range of activities, and go on to assess in 
an opportunistic and “naturalistic” fashion, as some of the claims for classroom authenticity 
might lead us to believe. Rather, teachers treated assessment as a special activity, set apart 
from teaching, and they felt obliged to do this by the instructions they received” (p. 85). 

Two major complaints about the implementation of the assessment program emerged from the 
survey. One was that the assessments contained so much new material that teachers felt 
“deskilled and overly dependent.” The second was that the materials were too specific to be 
used in a naturalistic way, yet the specificity was deemed necessary in order to use the data for 
comparative purposes. Teachers reported that the materials were not flexible enough to expand 
upon when students showed genuine interest, and many students were anxious about the 
results rather than interested in the learning. 
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The assessment was ‘'trimmed back” in 1991 and 1992, establishing what the author saw as a 
movement back toward paper-and-pencil tests. It is possible that we will see the same trend in 
the United States if testing programs have difficulty establishing reliable assessments. 

One can easily criticize Torrance-certainly the self-selected nature of the survey could have 
been very skewed; however, on April 17, 1996, Education Week reported that the national 
assessments resulted in a major labor dispute in which teachers refused to administer the tests, 
which were seen as cumbersome and unwieldy (Bradley, 1996). 

What are the Costs Versus the Benefits of High-Stakes Testing for 
Students with Disabiiities? 

Researchers need to look not only at the inequalities that exist for students of different SES, but 
they also need to look at those that exist for students with and without disabilities. As 
mentioned before, students with disabilities stand a great chance of failing minimum 
competency tests, even when they are given accommodations such as increased time to take the 
tests, interpreters, or large print versions (McKinney, 1983; Safer, 1980). Bergquist et al. 
(undated) mentioned the need to provide increased supports for students with disabilities not 
only to do well on the tests, but also to maintain an appropriate program incorporating their 
need for vocational as well as academic subjects. 

Directions 

Calculating the true costs of implementing statewide high stakes tests can be a daunting task. It 
would be useful for evaluators and researchers to have a framework to estimate the costs of 
testing programs, and to then balance these against the expected and realized benefits of the 
programs. 

We also need to think more about inclusion of students with disabilities when developing any 
testing program. Portfolio assessments may make it easier for students with disabilities to 
participate in statewide testing programs, but it is still not clear whether this type of assessment 
can have the reliability needed to be used for statewide comparisons, graduation requirements, 
or other high stakes purposes, and the costs of such programs may be prohibitive. 



Recommendations and Conclusions 

Special educators need to become more involved in development, implementation and 
evaluation of high stakes testing programs. The inclusion or exclusion of students with 
disabilities is a problem salient to both regular and special education researchers. The 
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following recommendations are just some of the many research and evaluation questions that 
should be taken into account in future investigations: 

• Focus on the effects of high stakes testing on students rather than on schools 
and systems. 

• Assess the effects of high stakes testing on the curriculum for both special and 
regular education. 

• Assess the effects of high stakes testing on the dropout rates for both special 
and regular education. 

• Study the effects of high stakes testing programs on students who are excluded 
from testing. 

• Develop assessments that are more inclusive of students with disabilities (for 
example, including students with disabilities in state norming samples, and 
norming tests with some students using common accommodations such as 
extended time and Braille). 

• Study the effects of high stakes testing on the relationship between regular and 
special education. 

• Develop a framework for evaluating the costs versus the benefits of high stakes 
testing programs, particularly for alternative and authentic assessments. 

The research on high stakes testing is inconclusive and results vary with the type of research 
questions asked, and the types of tests examined. The evidence suggests that teachers change 
the curriculum based on the tests, concentrating time and effort teaching to test contents and 
format. For students with disabilities, this may mean that less time is devoted to their 
vocational and other nonacademic needs, even though they are less likely to pass the test, even 
when given accommodations. The effects on student learning are largely unknown, but the 
evidence does suggest that increasing test scores themselves do not serve as evidence that 
students are learning more. Test scores can become inflated through teaching to the test, 
excluding or excusing students who may not perform well, and through increased dropout 
rates. High stakes testing seems to have a negative effect on the attitudes and workloads of 
teachers, but little is known about the effects on students themselves. States still do not take 
into account the full costs of high stakes testing programs, and claims that testing alone can 
cause major educational improvements have not been proven. Authentic and portfolio 
assessments hold promise, but it has not been established that these can be done well, 
efficiently and with sufficient reliability to be used as a large-scale comparative assessment 
method or for high stakes purposes like graduation. 

The effects of high stakes testing on students with disabilities are harder to determine. 
Students with disabilities have not performed well on minimum competency tests, and it is 
unclear what effects other types of tests have had on their educational outcomes, choices and 
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futures. The needs of students with disabilities have been largely overlooked in the 
development and evaluation of statewide test-based reforms. 

We do not advocate, as Corbett and Wilson (1991) suggest, that we abandon the practice of 
testing for educational outcomes altogether. It is our belief that schools need to be responsible 
to parents, taxpayers, and the community at large, for student outcomes. The evidence we 
have reviewed in this paper supports this position by demonstrating that exclusion of some 
students leads to abuses and statistical problems that make interpretation of test scores dubious 
at best. The evidence also tells us that testing alone is not a sufficient mechanism for effective 
school reform. The most properly conceived and implemented tests can only be effective if 
they support clear standards, solid curricula, committed and well-trained teachers, in schools 
with the resources and supports necessary to provide a world-class education to Mi students. 
Without these things, scores may continue to rise, but there is no reason to believe that our 
children will be any better educated or prepared for life than they were before. 
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